In this project, I conduct data analysis and visualization on different variables, including age, gender, marital status and country. Then I find that country is the variable I am most interested in (since there are so many countries in this data set and each country corresponds to various answers). I decide to study on the relaionships among country, original_hm and reflection_period. Finally, I discover that there could be some hidden patterns for the answers after spending time exploring them on different countries.

Part 1 - R Shiny

First, I work on the processed_moments data set with R Shiny to get a general idea about the answers of happy moments.

library(tidyverse)
library(tidytext)
library(DT)
library(scales)
library(wordcloud2)
library(gridExtra)
library(ngram)
library(shiny) 
library(ggplot2)
library(reshape2)
library(lemon)

Load the processed text data along with demographic information on contributors

We use the processed data for our analysis and combine it with the demographic information available.

hm_data <- read_csv("../output/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)

Combine both the data sets and keep the required columns for analysis

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period, 
                                        months_3 = "3m", hours_24 = "24h"))
datatable(hm_data)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Create a bag of words using the text data

bag_of_words <-  hm_data %>%
  unnest_tokens(word, text)

word_count <- bag_of_words %>%
  count(word, sort = TRUE)
ui <- navbarPage("What makes people happy?",
                 tabPanel("Overview",
                          
                          titlePanel(h1("Most Frequent Occurrences",
                                        align = "center")),
                          
                          sidebarLayout(
                            sidebarPanel(
                              sliderInput(inputId = "topWordcloud",
                                          label = "Number of terms for word cloud:",
                                          min = 5,
                                          max = 100,
                                          value = 50)
                            ),
                            
                            mainPanel(
                              wordcloud2Output(outputId = "WC")
                            )
                          )
                 )
)

Develop the server for the R Shiny app

server <- function(input, output, session) {
  
  output$WC <- renderWordcloud2({
    
    word_count %>%
      slice(1:input$topWordcloud) %>%
      wordcloud2(size = 0.6,
                 rotateRatio = 0)
    
  })
}

Part 2 - Sentiment analysis

After getting the most frequently occurring words, I would like to study the variables further to get more sufficient analysis on the answers.

Look into some variables

I discover that great amount of people are from USA and India, and there are 99 countries in total.

mycountry<-as.data.frame(table(hm_data$country))
mycountryupdated<-mycountry[order(mycountry$Freq, decreasing = T),]
mycountrytop10<-mycountryupdated[1:10,]

colnames(mycountrytop10) <- c("Country", "Count")
Countrylist<-c("USA", "India", "Venezuela", "Canada", "UK", "Philippines", "Vietnam", "Brazil", 
                         "Mexico", "Australia")
cbind(Countrylist,mycountrytop10)

There are more male being asked than famale, and more single being asked than married. The median and mean of age are 30 and 31.83 if I omit NA value and values larger than or equal to 120.

table(hm_data$gender)
## 
##     f     m 
## 38734 55840
table(hm_data$marital)
## 
## married  single 
##   40974   53600
agesub<-na.omit(as.numeric(hm_data$age[which(as.numeric(hm_data$age)<120)]))
## Warning in which(as.numeric(hm_data$age) < 120): NAs introduced by coercion
hist(agesub, main = 'Histogram of Age', xlab='Age',col="pink")

median(agesub)
## [1] 30
mean(agesub)
## [1] 31.83322

Explore text answers

In order to find patterns easily, I decide to study top 10 countries at this time because all the countries in top 10 has more than 100 answers.

mytext<-as.data.frame(hm_data)[hm_data$country %in% mycountrytop10$Country,]
textcount<-nchar(mytext$original_hm)
textcountry<-mytext$country
textdataframe<-data.frame(textcountry,textcount)

ggplot(data = textdataframe, aes(x = textcount)) +
geom_histogram(color="darkblue", fill="lightblue") +
facet_wrap(~ textcountry, scales="free",nrow = 4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the shapes of above histograms of top 10 countries, I find that different countries have different average characers’ lengths for the answers, so I do more work on this.

mean_of_text<-setNames(aggregate(textdataframe[, 2], list(textdataframe$textcountry), mean),c("Countries","Mean"))
mean_of_text
ggplot(mean_of_text, aes(x=Countries, y=Mean, color=Countries)) +
geom_point()

quantile_of_text<-aggregate(textdataframe[, 2], list(textdataframe$textcountry), quantile)
colnames(quantile_of_text)<-c("Countries","length")
quantile_of_text$length<-round(quantile_of_text$length)
quantile_of_text
s<-
"Countries length.0% length.25% length.50% length.75% length.100%
     AUS        13         41         56         93         297
     BRA        19         38         50         73         199
     CAN        16         42         60         86         497
     GBR        15         42         70        101        1233
     IND        11         41         79        151        6550
     MEX        15         43         60         90         267
     PHL        16         44         62         92         427
     USA         6         46         67         98        6557
     VEN        11         49         74        111         766
     VNM        15         31         46         63         336
"

d1 <- read.delim(textConnection(s), sep="")
d1 <- melt(d1, id.vars="Countries")

ggplot(d1, aes(Countries,value, col=variable)) + 
  geom_point() + 
  stat_smooth()
## `geom_smooth()` using method = 'loess'

As we can see from the above graph, USA and India have some extreme values for characters’ lengths. But since the data range is too large to make further analysis, I decide to look at data for each quantile seperately.

d2<- read.table(text = s, header = T)
d2 %>% tidyr::gather("id", "value", 2:6) %>% 
  ggplot(., aes(Countries, value))+
  geom_point( color="purple")+
  geom_smooth(method = "lm", se=FALSE)+
  facet_wrap(~id,scales="free", nrow=3)

India, Venezuela, USA and UK seem to have the highest mean and median caompare to other top 10 countries. I decide to study more on these four countries in order to get more valuable results.

I select 3 original_hms that have the most charactors in the response sentences for each of the four countries (India, Venezuela, USA and UK), and I would like to see what these sentences are.

four_countries<-as.data.frame(hm_data)[hm_data$country %in% c('USA','IND','VEN','GBR'),]

split<-split(four_countries, four_countries$country)
list_fourcountries<-lapply(split, function(y) 
 output<-list(max(nchar(y$original_hm)),
              max(nchar(y$original_hm)[nchar(y$original_hm)!=max(nchar(y$original_hm))]),
              max(nchar(y$original_hm)[nchar(y$original_hm)!=max(nchar(y$original_hm))]
                  [nchar(y$original_hm)[nchar(y$original_hm)!=max(nchar(y$original_hm))]
                    !=max(nchar(y$original_hm)[nchar(y$original_hm)!=max(nchar(y$original_hm))])])
              )
)

list_fourcountries<-as.data.frame(list_fourcountries)
colnames(list_fourcountries)<-c('GBR1','GBR2','GBR3','IND1','IND2','IND3',
                                'USA1','USA2','USA3','VEN1','VEN2','VEN3')
list_fourcountries<-data.frame(t(list_fourcountries))
list_fourcountries<-cbind(countries = rownames(list_fourcountries), list_fourcountries) 
rownames(list_fourcountries)<-c()
colnames(list_fourcountries)<-c('Country_rank_max','Value')
list_fourcountries
ggplot(data=list_fourcountries, aes(x = Country_rank_max, y= Value, fill = Country_rank_max)) +
geom_bar(stat="identity")+
scale_colour_gradient2()+
coord_flip()

The numbers right after countries’ names are different ranks of answers for different countries. For example, VEN1, VEN2 and VEN3 means “the longest answer for Venezuela”, “the second longest answer for Venezuela” and “the third longest answer for Venezuela”.

GBR1<-four_countries[which(nchar(four_countries$original_hm)==1233 
                           & four_countries$country=='GBR'),]
GBR2<-four_countries[which(nchar(four_countries$original_hm)==410 
                           & four_countries$country=='GBR'),]
GBR3<-four_countries[which(nchar(four_countries$original_hm)==318 
                           & four_countries$country=='GBR'),]

IND1<-four_countries[which(nchar(four_countries$original_hm)==6550 
                           & four_countries$country=='IND'),]
IND2<-four_countries[which(nchar(four_countries$original_hm)==6532 
                           & four_countries$country=='IND'),]
IND3<-four_countries[which(nchar(four_countries$original_hm)==5348 
                           & four_countries$country=='IND'),]

USA1<-four_countries[which(nchar(four_countries$original_hm)==6557 
                           & four_countries$country=='USA'),]
USA2<-four_countries[which(nchar(four_countries$original_hm)==2882 
                           & four_countries$country=='USA'),]
USA3<-four_countries[which(nchar(four_countries$original_hm)==2274 
                           & four_countries$country=='USA'),]

VEN1<-four_countries[which(nchar(four_countries$original_hm)==766 
                           & four_countries$country=='VEN'),]
VEN2<-four_countries[which(nchar(four_countries$original_hm)==727 
                           & four_countries$country=='VEN'),]
VEN3<-four_countries[which(nchar(four_countries$original_hm)==606 
                           & four_countries$country=='VEN'),]

sentence_fourcountries<-rbind(GBR1, GBR2, GBR3, IND1, IND2, IND3, USA1, USA2, USA3, VEN1, VEN2, VEN3)
head(sentence_fourcountries, 2)

UK: I found that for the selected three long answers in UK, all the the reflection periods are three months. So people trend to describe a story of their lifes, or a feeling caused by some events that can last for long time, such as “a very good traning program” and “spring has arrived”.

India: However, the reflection period is 24 hours for the longest answer in India. The answer is about a special moment on “most beautiful symbol of eternal love” when the author visited Taj Mahal. The second longest answer is about a trip to Velankanni. The author talked about how he planed the trip before travlling and some memorable experiences during the trip. The third longest answer is about what can make us happy in general, not a specific event or moment. The answer is more like a study report including conclusions like “exercise more, 7 minutes might be enough for happiness”, “sleep more can make you less sensitive to negative emotions” and “a short commute is worth more than a big house if we move closer to work”.

USA: The reflection period is 24 hours for the longest answer in USA as well. This seems to be the same story with the longest answer in India, but other information of respondents are quite different. I doubt whether there are some mistake when people collected these answers originally. The second longest answer is about a summer vacation from Delhi to Ooty. The reflection period is 3 months. The third longest answer is about an experience of “helping a blind man to cross the street”. The reflection period is also 3 months, but the author talked about a less than one-day unforgettable expeience.

Venezuela: The longest answer is a moment about coming across a girl on the street. The reflection period is 24 hours. The second longest answer is about a moment of watching a son falling asleep by a mom. The reflection period is 24 hours. The third longest answer is about a moment of seeing blue sky and green trees on a soft moring. The author also mentioned that she lives in a country that has difficult times. The reflection period is 24 hours.

Conclusion: For the very long answers, the reflection periods tend to be 3 months. However, the reflection periods for the long answers in Venezuela are all 24 hours. This may be because the longest answers in Venezuela are much shorter than the longest answers in the other three countries I research on.

I would like to make some further conclusions, but let me do more research on the top 3 shortest answers for the same four countries before writing them down.

list_fourcountries2<-lapply(split, function(y) 
 output2<-list(min(nchar(y$original_hm)),
              min(nchar(y$original_hm)[nchar(y$original_hm)!=min(nchar(y$original_hm))]),
              min(nchar(y$original_hm)[nchar(y$original_hm)!=min(nchar(y$original_hm))]
                  [nchar(y$original_hm)[nchar(y$original_hm)!=min(nchar(y$original_hm))]
                    !=min(nchar(y$original_hm)[nchar(y$original_hm)!=min(nchar(y$original_hm))])])
              )
)

list_fourcountries2<-as.data.frame(list_fourcountries2)
colnames(list_fourcountries2)<-c('GBR1','GBR2','GBR3','IND1','IND2','IND3',
                                'USA1','USA2','USA3','VEN1','VEN2','VEN3')
list_fourcountries2<-data.frame(t(list_fourcountries2))
list_fourcountries2<-cbind(countries = rownames(list_fourcountries2), list_fourcountries2) 
rownames(list_fourcountries2)<-c()
colnames(list_fourcountries2)<-c('Country_rank_min','Value')
list_fourcountries2
ggplot(data=list_fourcountries2, aes(x = Country_rank_min, y= Value, fill = Country_rank_min)) +
geom_bar(stat="identity")+
scale_colour_gradient2()+
coord_flip()

GBR1_2<-four_countries[which(nchar(four_countries$original_hm)==15
                           & four_countries$country=='GBR'),]
GBR2_2<-four_countries[which(nchar(four_countries$original_hm)==19
                           & four_countries$country=='GBR'),]
GBR3_2<-four_countries[which(nchar(four_countries$original_hm)==21
                           & four_countries$country=='GBR'),]

IND1_2<-four_countries[which(nchar(four_countries$original_hm)==11
                           & four_countries$country=='IND'),]
IND2_2<-four_countries[which(nchar(four_countries$original_hm)==12 
                           & four_countries$country=='IND'),]
IND3_2<-four_countries[which(nchar(four_countries$original_hm)==13
                           & four_countries$country=='IND'),]

USA1_2<-four_countries[which(nchar(four_countries$original_hm)==6
                           & four_countries$country=='USA'),]
USA2_2<-four_countries[which(nchar(four_countries$original_hm)==7
                           & four_countries$country=='USA'),]
USA3_2<-four_countries[which(nchar(four_countries$original_hm)==8
                           & four_countries$country=='USA'),]

VEN1_2<-four_countries[which(nchar(four_countries$original_hm)==11
                           & four_countries$country=='VEN'),]
VEN2_2<-four_countries[which(nchar(four_countries$original_hm)==14
                           & four_countries$country=='VEN'),]
VEN3_2<-four_countries[which(nchar(four_countries$original_hm)==15
                           & four_countries$country=='VEN'),]

sentence_fourcountries2<-rbind(GBR1_2,GBR2_2,GBR3_2,IND1_2,IND2_2,IND3_2,USA1_2,USA2_2,USA3_2,VEN1_2,VEN2_2,VEN3_2)
head(sentence_fourcountries2, 2)
nrow(sentence_fourcountries2)
## [1] 47
split2<-split(sentence_fourcountries2, sentence_fourcountries2$reflection_period)
number_of_24_hours<-as.numeric(rapply(split2, length, how = "list")[[1]][1])
number_of_3_months<-as.numeric(rapply(split2, length, how = "list")[[2]][1])
reflection_time<-data.frame(number_of_24_hours, number_of_3_months)
reflection_time

There are 47 shortest answers (not 12) generated by my code, which means some shortest answers have the same length. For the short answers, people tend to describe one quick moment they feel happy, such as “went movie”, “won 50$”, “played soccer”, “had a shower”, “ate Mexican food” and “buy a car”. They don’t usually describe a whole story or a travel journey from their life experiences as they do in longer answers. The number of reflection periods of 24 hours is 33, which is much more than 14, the number of reflection periods of 3 months.

Furthermore, I think the answers of developed countries may be a little different from the answers of less developed countries. People in less developed countries tend to get happy from smaller events than people in developed countries. I discovered this from both top 3 longest answers analysis and top 3 shortest answers analysis. In top 3 longest answers, there is a women from Venezuela says she could feel a sense of placidity and happiness from a soft morning, although she lives in a country that has difficult times. And the other two answers are also not talking about a long-lasting or big event. In top 3 shortest answers, people from Venezuela tend to talked about moments that are more commonly happened in our lifes, even if all the shortest answers tend to talk about smaller events compared with the longest answers.

However, I am not making further inference based on these results because the total numbers of respondents from different countries are quite different. It is not fair to only compare the numbers along. In addition, there could be some other reasons that are not directly related to this study, but still have influences on the study.

Run the R Shiny app

shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents